forked from git-for-windows/git
    
        
        - 
                Notifications
    You must be signed in to change notification settings 
- Fork 106
          ci: backport fix for osx-* 6h timeouts
          #700
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Merged
      
        
      
            derrickstolee
  merged 2 commits into
  microsoft:vfs-2.47.0
from
gitgitgadget:jk/fsmonitor-event-listener-race-fix
  
      
      
   
  Oct 22, 2024 
      
    
                
     Merged
            
            
  
    ci: backport fix for osx-* 6h timeouts
  
  #700
              
                    derrickstolee
  merged 2 commits into
  microsoft:vfs-2.47.0
from
gitgitgadget:jk/fsmonitor-event-listener-race-fix
  
      
      
   
  Oct 22, 2024 
              
            
      
        
          +98
        
        
          −18
        
        
          
        
      
    
  
Conversation
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
    To start an async ipc server, you call ipc_server_run_async(). That
initializes the ipc_server_data object, and starts all of the threads
running, which may immediately start serving clients.
This can create some awkward timing problems, though. In the fsmonitor
daemon (the sole user of the simple-ipc system), we want to create the
ipc server early in the process, which means we may start serving
clients before the rest of the daemon is fully initialized.
To solve this, let's break run_async() into two parts: an initialization
which allocates all data and spawns the threads (without letting them
run), and a start function which actually lets them begin work. Since we
have two simple-ipc implementations, we have to handle this twice:
  - in ipc-unix-socket.c, we have a central listener thread which hands
    connections off to worker threads using a work_available mutex. We
    can hold that mutex after init, and release it when we're ready to
    start.
    We do need an extra "started" flag so that we know whether the main
    thread is holding the mutex or not (e.g., if we prematurely stop the
    server, we want to make sure all of the worker threads are released
    to hear about the shutdown).
  - in ipc-win32.c, we don't have a central mutex. So we'll introduce a
    new startup_barrier mutex, which we'll similarly hold until we're
    ready to let the threads proceed.
    We again need a "started" flag here to make sure that we release the
    barrier mutex when shutting down, so that the sub-threads can
    proceed to the finish.
I've renamed the run_async() function to init_async() to make sure we
catch all callers, since they'll now need to call the matching
start_async().
We could leave run_async() as a wrapper that does both, but there's not
much point. There are only two callers, one of which is fsmonitor, which
will want to actually do work between the two calls. And the other is
just a test-tool wrapper.
For now I've added the start_async() calls in fsmonitor where they would
otherwise have happened, so there should be no behavior change with this
patch.
Signed-off-by: Jeff King <[email protected]>
Acked-by: Koji Nakamaru <[email protected]>
Signed-off-by: Junio C Hamano <[email protected]>
    There's a racy hang in fsmonitor on macOS that we sometimes see in CI.
When we serve a client, what's supposed to happen is:
  1. The client thread calls with_lock__wait_for_cookie() in which we
     create a cookie file and then wait for a pthread_cond event
  2. The filesystem event listener sees the cookie file creation, does
     some internal book-keeping, and then triggers the pthread_cond.
But there's a problem: we start the listener that accepts client threads
before we start the fs event thread. So it's possible for us to accept a
client which creates the cookie file and starts waiting before the fs
event thread is initialized, and we miss those filesystem events
entirely. That leaves the client thread hanging forever.
In CI, the symptom is that t9210 (which is testing scalar, which always
enables fsmonitor under the hood) may hang forever in "scalar clone". It
is waiting on "git fetch" which is waiting on the fsmonitor daemon.
The race happens more frequently under load, but you can trigger it
predictably with a sleep like this, which delays the start of the fs
event thread:
  --- a/compat/fsmonitor/fsm-listen-darwin.c
  +++ b/compat/fsmonitor/fsm-listen-darwin.c
  @@ -510,6 +510,7 @@ void fsm_listen__loop(struct fsmonitor_daemon_state *state)
          FSEventStreamSetDispatchQueue(data->stream, data->dq);
          data->stream_scheduled = 1;
  +       sleep(1);
          if (!FSEventStreamStart(data->stream)) {
                  error(_("Failed to start the FSEventStream"));
                  goto force_error_stop_without_loop;
One solution might be to reverse the order of initialization: start the
fs event thread before we start the thread listening for clients. But
the fsmonitor code explicitly does it in the opposite direction. The fs
event thread wants to refer to the ipc_server_data struct, so we need it
to be initialized first.
A further complication is that we need a signal from the fs event thread
that it is actually ready and listening. And those details happen within
backend-specific fsmonitor code, whereas the initialization is in the
shared code.
So instead, let's use the ipc_server init/start split added in the
previous commit. The generic fsmonitor code will init the ipc_server but
_not_ start it, leaving that to the backend specific code, which now
needs to call ipc_server_start_async() at the right time.
For macOS, that is right after we start the FSEventStream that you can
see in the diff above.
It's not clear to me if Windows suffers from the same problem (and we
simply don't trigger it in CI), or if it is immune. Regardless, the
obvious place to start accepting clients there is right after we've
established the ReadDirectoryChanges watch.
This makes the hangs go away in our macOS CI environment, even when
compiled with the sleep() above.
Helped-by: Koji Nakamaru <[email protected]>
Signed-off-by: Jeff King <[email protected]>
Acked-by: Koji Nakamaru <[email protected]>
Signed-off-by: Junio C Hamano <[email protected]>
    
            
                  derrickstolee
  
            
            approved these changes
            
                
                  Oct 22, 2024 
                
            
            
          
          
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this into all the right places.
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
      
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
For quite a while now we have been haunted by those evil timeouts in some of the slowest CI jobs. There is a fix in upstream Git now, already on their
masterbranch, but still unreleased. I have integrated a backport in Git for Windows v2.47.0(2) via git-for-windows#5221.Let's integrate this into
vfs-2.47.0separately since I do not plan on merging Git for Windows'mainbranch intovfs-2.47.0any time soon (basically, I integrated those recent changes only because I had to have a version out that fixes git-for-windows#5199 but v2.47.0.vfs.0.2 already fixed that, so there is no urgency).